Contingency Matrix (sklearn.metrics.cluster.contingency_matrix)#
The contingency matrix is a count table that summarizes how two labelings overlap. In clustering evaluation it is commonly built from:
labels_true: ground-truth classes (if available)labels_pred: predicted cluster IDs (arbitrary up to permutation)
It is the fundamental object behind many external clustering scores (purity, mutual information, Rand index, …).
Quick import#
from sklearn.metrics.cluster import contingency_matrix
Learning goals#
Define the contingency matrix with clear math notation
Implement it from scratch in NumPy
Visualize and interpret it (raw counts + normalized views)
Use it to align cluster IDs to classes and compute simple derived scores (purity, “best-mapped” accuracy)
Use it for a simple hyperparameter search (choose
kfor k-means) when ground truth exists
import itertools
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
from scipy import sparse as sp
from scipy.optimize import linear_sum_assignment
from sklearn.metrics.cluster import contingency_matrix
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Definition#
Let there be \(n\) items. For each item \(i\) we have two labels:
a reference label (often ground truth) \(y_i\)
a predicted label (often a cluster ID) \(z_i\)
Assume the unique values in \(y\) are \(\mathcal{C} = \{c_1,\dots,c_C\}\) and in \(z\) are \(\mathcal{K} = \{k_1,\dots,k_K\}\).
The contingency matrix \(N \in \mathbb{R}^{C \times K}\) is defined by:
Interpretation:
Row \(a\) describes how true class \(c_a\) is split across clusters.
Column \(b\) describes which true classes appear inside cluster \(k_b\).
Useful marginals:
and \(\sum_{a,b} N_{a,b} = n\).
Probabilistic view#
Dividing by \(n\) gives the empirical joint distribution:
Many information-theoretic clustering metrics (e.g. mutual information) are computed from \(P\) (plus its row/column marginals).
eps in scikit-learn#
sklearn.metrics.cluster.contingency_matrix has an optional eps argument that adds a constant to every entry:
This is mainly a numerical trick to avoid log(0) / NaN propagation in downstream computations.
It changes the counts, so use it intentionally.
2) NumPy implementation (dense, plus an optional sparse variant)#
Below is a small from-scratch implementation.
Notes:
scikit-learn returns only the matrix; here we also return the unique label values so we can label axes in plots.
sparse=Truereturns a CSR matrix (useful when \(C\times K\) is huge but only few entries are non-zero).
def contingency_matrix_np(labels_true, labels_pred, *, eps=None, sparse=False, dtype=np.int64):
"""Build a contingency matrix N where N[a,b] = #{i : y_i=c_a and z_i=k_b}.
Returns
-------
cm : ndarray or scipy.sparse.csr_matrix of shape (n_true, n_pred)
true_labels : ndarray of shape (n_true,)
Sorted unique values from labels_true (np.unique ordering).
pred_labels : ndarray of shape (n_pred,)
Sorted unique values from labels_pred (np.unique ordering).
"""
y = np.asarray(labels_true)
z = np.asarray(labels_pred)
if y.shape != z.shape:
raise ValueError(f"labels_true and labels_pred must have the same shape; got {y.shape} vs {z.shape}")
y = y.ravel()
z = z.ravel()
true_labels, y_inv = np.unique(y, return_inverse=True)
pred_labels, z_inv = np.unique(z, return_inverse=True)
n_true = true_labels.size
n_pred = pred_labels.size
if sparse:
if eps is not None:
raise ValueError("eps must be None when sparse=True (matches scikit-learn behavior)")
data = np.ones_like(y_inv, dtype=dtype)
cm = sp.coo_matrix((data, (y_inv, z_inv)), shape=(n_true, n_pred), dtype=dtype).tocsr()
return cm, true_labels, pred_labels
cm = np.zeros((n_true, n_pred), dtype=dtype)
np.add.at(cm, (y_inv, z_inv), 1)
if eps is not None:
cm = cm.astype(np.float64, copy=False) + float(eps)
return cm, true_labels, pred_labels
labels_true = np.array(["cat", "cat", "dog", "dog", "dog", "fish", "fish"])
labels_pred = np.array([1, 1, 0, 0, 2, 2, 2])
cm_np, y_vals, z_vals = contingency_matrix_np(labels_true, labels_pred)
cm_sk = contingency_matrix(labels_true, labels_pred)
print("true label values:", y_vals)
print("pred label values:", z_vals)
print("\ncontingency (NumPy):\n", cm_np)
print("\ncontingency (sklearn):\n", cm_sk)
print("\nallclose:", np.allclose(cm_np, cm_sk))
cm_sparse, _, _ = contingency_matrix_np(labels_true, labels_pred, sparse=True)
print("\nsparse (csr) -> dense:\n", cm_sparse.toarray())
true label values: ['cat' 'dog' 'fish']
pred label values: [0 1 2]
contingency (NumPy):
[[0 2 0]
[2 0 1]
[0 0 2]]
contingency (sklearn):
[[0 2 0]
[2 0 1]
[0 0 2]]
allclose: True
sparse (csr) -> dense:
[[0 2 0]
[2 0 1]
[0 0 2]]
fig = px.imshow(
cm_np,
text_auto=True,
aspect="auto",
x=[str(v) for v in z_vals],
y=[str(v) for v in y_vals],
labels={"x": "predicted label / cluster", "y": "true label", "color": "count"},
title="Contingency matrix (raw counts)",
)
fig.update_layout(height=350)
fig.show()
3) Normalized views (row-wise vs column-wise)#
Raw counts are great for interpretability, but they scale with dataset size. Two common normalizations turn the table into conditional probabilities:
Row-normalized: \(N_{a,b} / n_a \approx P(z=k_b \mid y=c_a)\)
Column-normalized: \(N_{a,b} / m_b \approx P(y=c_a \mid z=k_b)\)
Row-normalization answers: “Given a true class, how does it get distributed over clusters?” Column-normalization answers: “Given a predicted cluster, what’s its composition?”
row_sums = cm_np.sum(axis=1, keepdims=True)
col_sums = cm_np.sum(axis=0, keepdims=True)
row_norm = np.divide(cm_np, row_sums, where=row_sums > 0)
col_norm = np.divide(cm_np, col_sums, where=col_sums > 0)
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Row-normalized: P(cluster | class)", "Column-normalized: P(class | cluster)"),
)
fig.add_trace(
go.Heatmap(
z=row_norm,
x=[str(v) for v in z_vals],
y=[str(v) for v in y_vals],
zmin=0,
zmax=1,
colorbar=dict(title="prob"),
),
row=1,
col=1,
)
fig.add_trace(
go.Heatmap(
z=col_norm,
x=[str(v) for v in z_vals],
y=[str(v) for v in y_vals],
zmin=0,
zmax=1,
showscale=False,
),
row=1,
col=2,
)
fig.update_layout(height=360, title="Same contingency table, different questions")
fig.show()
4) Label permutation (why clusters are different from classes)#
Cluster IDs have no semantic meaning: cluster 0 vs 1 is arbitrary.
If we relabel clusters, the contingency matrix columns simply permute.
This is why many clustering scores are label-permutation invariant. If you compare contingency matrices directly, make sure you’re comparing them in a consistent column order.
# Permute the predicted labels (swap the meaning of cluster IDs)
# Here z_vals are [0, 1, 2], so we can permute by an index mapping.
perm = np.array([2, 0, 1]) # 0->2, 1->0, 2->1
labels_pred_perm = perm[labels_pred]
cm_perm, _, z_vals_perm = contingency_matrix_np(labels_true, labels_pred_perm)
inv_perm = np.argsort(perm) # new->old
print("perm (old->new):", perm)
print("inv_perm (new->old):", inv_perm)
print("original pred label values:", z_vals)
print("permuted pred label values:", z_vals_perm)
print("cm_perm == cm_np[:, inv_perm] ?", np.allclose(cm_perm, cm_np[:, inv_perm]))
fig = make_subplots(rows=1, cols=2, subplot_titles=("Original", "After permuting cluster IDs"))
fig.add_trace(
go.Heatmap(z=cm_np, x=[str(v) for v in z_vals], y=[str(v) for v in y_vals], text=cm_np, texttemplate="%{text}"),
row=1,
col=1,
)
fig.add_trace(
go.Heatmap(
z=cm_perm,
x=[str(v) for v in z_vals_perm],
y=[str(v) for v in y_vals],
text=cm_perm,
texttemplate="%{text}",
showscale=False,
),
row=1,
col=2,
)
fig.update_layout(height=350, title="Permuting cluster IDs permutes contingency columns")
fig.show()
perm (old->new): [2 0 1]
inv_perm (new->old): [1 2 0]
original pred label values: [0 1 2]
permuted pred label values: [0 1 2]
cm_perm == cm_np[:, inv_perm] ? True
5) A 2D clustering example (synthetic data)#
We’ll create 3 Gaussian blobs with known labels (only for evaluation), run a simple NumPy k-means, and then visualize the resulting contingency matrix.
# Synthetic 2D dataset with 3 ground-truth classes
centers = np.array([
[-2.0, -2.0],
[0.0, 2.5],
[2.5, -0.5],
])
std = 0.7
n_per_class = 180
X = np.vstack([rng.normal(loc=c, scale=std, size=(n_per_class, 2)) for c in centers])
y_true = np.repeat(np.arange(len(centers)), n_per_class)
# Shuffle so class blocks are mixed
perm_idx = rng.permutation(X.shape[0])
X = X[perm_idx]
y_true = y_true[perm_idx]
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=y_true.astype(str),
title="Ground truth classes (for evaluation only)",
labels={"x": "x1", "y": "x2", "color": "true class"},
)
fig.show()
def kmeans_single_run(X, k, *, n_iters=50, rng=None):
"""Basic k-means (Lloyd's algorithm) for Euclidean distance.
Returns
-------
labels : ndarray of shape (n_samples,)
centers : ndarray of shape (k, n_features)
inertia : float
Sum of squared distances to assigned centers.
"""
if rng is None:
rng = np.random.default_rng()
X = np.asarray(X, dtype=float)
n_samples = X.shape[0]
# Init centers by sampling points
init_idx = rng.choice(n_samples, size=k, replace=False)
centers = X[init_idx].copy()
labels = np.zeros(n_samples, dtype=int)
for _ in range(n_iters):
# Assignment step
d2 = np.sum((X[:, None, :] - centers[None, :, :]) ** 2, axis=2) # (n_samples, k)
new_labels = np.argmin(d2, axis=1)
# Update step
new_centers = centers.copy()
for j in range(k):
mask = new_labels == j
if not np.any(mask):
# Empty cluster -> re-init to a random point
new_centers[j] = X[rng.integers(0, n_samples)]
else:
new_centers[j] = X[mask].mean(axis=0)
if np.array_equal(new_labels, labels):
centers = new_centers
labels = new_labels
break
centers = new_centers
labels = new_labels
inertia = float(np.sum((X - centers[labels]) ** 2))
return labels, centers, inertia
def kmeans_np(X, k, *, n_iters=50, n_init=10, rng=None):
"""Run k-means with multiple random initializations and keep the best inertia."""
if rng is None:
rng = np.random.default_rng()
best_labels = None
best_centers = None
best_inertia = np.inf
for _ in range(n_init):
labels, centers, inertia = kmeans_single_run(X, k, n_iters=n_iters, rng=rng)
if inertia < best_inertia:
best_inertia = inertia
best_labels = labels
best_centers = centers
return best_labels, best_centers, float(best_inertia)
k = 3
labels_km, centers_km, inertia_km = kmeans_np(X, k, n_iters=60, n_init=8, rng=rng)
print("k-means inertia (SSE):", inertia_km)
fig = px.scatter(
x=X[:, 0],
y=X[:, 1],
color=labels_km.astype(str),
title="Predicted k-means clusters (k=3)",
labels={"x": "x1", "y": "x2", "color": "cluster"},
)
fig.add_trace(
go.Scatter(
x=centers_km[:, 0],
y=centers_km[:, 1],
mode="markers",
name="centers",
marker=dict(symbol="x", size=14, color="black"),
)
)
fig.show()
k-means inertia (SSE): 511.10262206779885
cm, y_vals, z_vals = contingency_matrix_np(y_true, labels_km)
fig = px.imshow(
cm,
text_auto=True,
aspect="auto",
x=[str(v) for v in z_vals],
y=[str(v) for v in y_vals],
labels={"x": "cluster", "y": "true class", "color": "count"},
title="Contingency matrix for k-means vs ground truth",
)
fig.update_layout(height=360)
fig.show()
# Also show column-normalized view: P(class | cluster)
col_sums = cm.sum(axis=0, keepdims=True)
cm_col_norm = np.divide(cm, col_sums, where=col_sums > 0)
fig = px.imshow(
cm_col_norm,
text_auto=".2f",
aspect="auto",
x=[str(v) for v in z_vals],
y=[str(v) for v in y_vals],
labels={"x": "cluster", "y": "true class", "color": "P(class | cluster)"},
title="Column-normalized: how pure is each cluster?",
zmin=0,
zmax=1,
)
fig.update_layout(height=360)
fig.show()
6) Derived scores + label alignment#
The contingency matrix itself is not a single “score”, but you can derive many useful quantities.
Purity#
A simple external clustering score is purity:
It measures how dominated each cluster is by its majority class. Purity increases as you increase \(K\) (more clusters) and can reach 1.0 when every point is its own cluster.
“Best-mapped” accuracy#
Because cluster IDs are arbitrary, people sometimes map each cluster to a class and then compute an accuracy-like number.
Majority-vote mapping: map each cluster to its majority class.
One-to-one mapping (when \(K=C\)): choose a permutation that maximizes the total matches.
These are useful for interpretation and benchmarking, but they are not a loss you can optimize with gradients.
def purity_from_contingency(cm):
cm = np.asarray(cm)
return float(np.sum(np.max(cm, axis=0)) / np.sum(cm))
def majority_vote_mapping(cm, true_labels, pred_labels):
"""Map each predicted label to the true label that is most frequent in that cluster."""
cm = np.asarray(cm)
best_true_idx_per_cluster = np.argmax(cm, axis=0)
mapping = {pred_labels[j]: true_labels[i] for j, i in enumerate(best_true_idx_per_cluster)}
return mapping
purity = purity_from_contingency(cm)
mv_map = majority_vote_mapping(cm, y_vals, z_vals)
y_pred_mv = np.array([mv_map[z] for z in labels_km])
acc_mv = float(np.mean(y_pred_mv == y_true))
print("purity:", purity)
print("majority-vote mapping:", mv_map)
print("majority-vote mapped accuracy:", acc_mv)
purity: 0.9981481481481481
majority-vote mapping: {0: 1, 1: 2, 2: 0}
majority-vote mapped accuracy: 0.9981481481481481
# One-to-one label alignment via the assignment problem (Hungarian algorithm)
# Works best when K == C; otherwise it matches only min(C, K) pairs.
row_ind, col_ind = linear_sum_assignment(-cm) # maximize sum of counts
hungarian_map = {z_vals[j]: y_vals[i] for i, j in zip(row_ind, col_ind)}
y_pred_h = np.array([hungarian_map[z] for z in labels_km])
acc_h = float(np.mean(y_pred_h == y_true))
print("Hungarian mapping:", hungarian_map)
print("Hungarian mapped accuracy:", acc_h)
Hungarian mapping: {2: 0, 0: 1, 1: 2}
Hungarian mapped accuracy: 0.9981481481481481
7) Using it for a simple optimization: choosing k for k-means (external criterion)#
The contingency matrix requires ground truth labels, so it is mainly used for:
benchmarking clustering algorithms on labeled datasets
semi-supervised settings (you do have labels)
debugging / sanity checks on synthetic data
A simple way to “optimize” with it is to pick hyperparameters that maximize a derived external score (here: purity).
We’ll compare:
Purity (external, uses
y_true)Inertia / SSE (internal k-means objective, does not use labels)
def purity_score_np(y_true, z_pred):
cm, _, _ = contingency_matrix_np(y_true, z_pred)
return purity_from_contingency(cm)
ks = list(range(2, 9))
purities = []
inertias = []
for k in ks:
labels_k, centers_k, inertia_k = kmeans_np(X, k, n_iters=70, n_init=10, rng=rng)
purities.append(purity_score_np(y_true, labels_k))
inertias.append(inertia_k)
fig = make_subplots(specs=[[{"secondary_y": True}]])
fig.add_trace(go.Scatter(x=ks, y=purities, mode="lines+markers", name="Purity"), secondary_y=False)
fig.add_trace(go.Scatter(x=ks, y=inertias, mode="lines+markers", name="Inertia (SSE)"), secondary_y=True)
fig.update_xaxes(title_text="k (#clusters)")
fig.update_yaxes(title_text="Purity (higher better)", range=[0, 1.05], secondary_y=False)
fig.update_yaxes(title_text="Inertia / SSE (lower better)", secondary_y=True)
fig.update_layout(title="Hyperparameter selection with contingency-derived purity")
fig.show()
# Multiple restarts: internal objective (inertia) vs external purity
k = 3
seeds = rng.integers(0, 1_000_000, size=40)
purity_list = []
inertia_list = []
for s in seeds:
labels_s, centers_s, inertia_s = kmeans_single_run(X, k, n_iters=70, rng=np.random.default_rng(int(s)))
inertia_list.append(inertia_s)
purity_list.append(purity_score_np(y_true, labels_s))
fig = px.scatter(
x=inertia_list,
y=purity_list,
title="Random restarts (k=3): inertia vs purity",
labels={"x": "inertia / SSE", "y": "purity"},
)
fig.show()
8) Pros, cons, and when to use it#
Pros#
Interpretable: a direct count table of overlaps.
Fast: \(\mathcal{O}(n)\) time to build once labels are mapped to indices.
Foundational: many clustering metrics are functions of the contingency matrix.
Flexible: works with any hashable labels (ints, strings, …).
Sparse-friendly: can be represented efficiently when \(C\times K\) is huge but most pairs never occur.
Cons#
Not a scalar score: you usually need an additional reduction (purity, MI, ARI, …).
Requires reference labels (
labels_true): not available in pure unsupervised clustering.Easy to game with \(K\) (more clusters) if you turn it into purity/accuracy-like scores.
Comparisons need care: raw counts depend on dataset size and class imbalance.
Good uses#
Debugging clustering pipelines on labeled or synthetic datasets.
Understanding failure modes (which classes get mixed/split).
As an intermediate object to compute robust, permutation-invariant clustering scores.
9) Pitfalls + diagnostics#
Axis confusion: rows correspond to
labels_true, columns tolabels_pred.Permutation invariance: cluster IDs are arbitrary; a “different looking” matrix may just be a column permutation.
Imbalance: a large class can dominate counts; inspect normalized views (\(P(\text{cluster}|\text{class})\) and \(P(\text{class}|\text{cluster})\)).
Different numbers of classes/clusters: one-to-one alignment is not well-defined when \(K\neq C\).
Using labels for tuning: choosing hyperparameters by maximizing a contingency-derived score leaks supervision.
epschanges the table: only use it for numerical reasons in downstream log-based metrics.
10) Exercises#
Implement a dense contingency matrix builder using only
np.bincount(hint: flatten 2D indices).For the synthetic dataset, compare purity to a permutation-invariant score like ARI or NMI.
Show how purity behaves as you let \(k\) grow all the way to the number of points.
Implement a one-to-one alignment using brute-force permutations for \(K\leq 7\) and compare to the Hungarian solution.
References#
scikit-learn docs: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html